Module 1
University of South Florida
Write once, run everywhere: dplyr syntax that can be used in:
arrow for fast out-of-core data
dbplyr for databases (duckdb, SQLite, etc.)
sparklyr for spark integration
duckplyr for duckdb backend
and others (dtplyr, tidypolars)
A path is an address of file or folder
Types of paths:
Begins from the root directory (/ in Mac/Linux, C:\ in Windows)
/Users/username/Documents/project/data.csvC:\Users\username\Documents\project\data.csvUnambiguous
Not dependent on the current working directory
OS dependent (Windows vs macOS & Linux)
The location relative to the current working directory.
data/project/data.csvPortable across systems
Easier to manage in projects
The working directory is the folder where R is located
getwd() shows the current working directorysetwd("/path/to/directory") to changeWindows: username/Documents folder
C:\Users\<username>\Documents.john home directory would be C:\Users\john\Documents.Mac / Linux : username folder
/Users/<username>/Users/john~Represents the user’s home directory.
In R, ~ refers to
C:\Users\<username>\Documents for windows/Users/<username> for Mac/LinuxExample: ~/cases refers to :
C:\Users\<username>\Documents\cases for windows/Users/john/cases for Mac/Linux. refers to the current directory.
/Users/john/projects./data refers to /Users/john/projects/data.. refers to the parent directory, one level up from the current directory.
/Users/john/projects../data refers to /Users/john/dataYou can use R to do basic math calculations
Concatenate multiple elementsTo concatenate multiple elements to a vector:
use c() function:
Tip
An (atomic) vector in R is an ordered pair of multiple elements.
You can simply calculate vectorized calculations with R:
[1] 3 6 9
[1] 15 18 21 24
Tip
Vectorized operations are much faster than looping over each elements.
In R, most operations are automatically vectorized.
Use <- to assign a value to a symbol (name).
Here, my_number is called symbol, or name of an object.
Tip
Style guide: though you still can use =, use <- for assignment.
Use = for specifying function arguments instead.
Some IDEs (i.e. RStudio / Positron / VScode) have Alt (Option) + - as a shortcut.
R has strict rules about a syntactic name (symbol).
You can’t use reserved words like TRUE, NULL, if, etc.
If you’d deliberately use non-syntactic names, use backtick `
Since objects cannot contain whitespace as symbol, there are two popular naming conventions.
snake_case
camelCase
Tip
It is better to make a short, self-explanatory name.
e.g. weight <- 15 is easier to understand than my_variable_quantity <- 15
On console: > means: “Waiting your command”
+ means: “Continue command”
Hit CTRL + C to abort in console
Vector type: common data type
Special type (non-vector): non-vectors
Vectors are the most important family of data types in R.
Vector is a data structure that stores multiple elements. It comes in two flavors:
NULL is not a vector, but often serves as zero length vector.
There are four primary types of atomic vector in R, and two others.
TRUE, FALSE, NAAttach L to treat the number as strict integer number.
Caution
“numeric” is a collective term for both double and integers but often used as if it were a synonym for “double” or “real number” in practice.
" or '[1] "character"
[1] "character"
Tip
Style guide: Use double quote " for character instead of ' if possible.
raw type: binary data type
complex type: complex numbers (e.g. 3 + 4i)
rarely needed in Finance
NA
NULL is intentional empty “placeholder” in RWhat are four primary types of atomic vector?
What are the types of a,b,c,d below?
Confirm your answer with typeof().
List is a generic vector that is not atomic.
Atomic vector can have only one type for its elements (Double, Integer, Logical, …)
List can hold multiple data types for its member (even list itself)
The class attributes in R is used to define the behavior of objects with functions.
Especially important classes in Finance are:
Very important class in Finance.
They are built from double type atomic vector (type), but has own specific rule for uses (class).
Check their data type:
Check their attributes: they have class attributes.
To directly access the class attribute:
For an example, see how it works with + function.
[1] "2026-02-24"
[1] "2026-02-23 11:38:59 EST"
Q: Why +1 yield different results?
A: Because they are in different classes. + 1 is inferred differently.
Class attribute gives context how it should behave with functions.
In the deep down, they are just numbers:
typeof(c(1,2,3)) and typeof(c(1L, 2L, 3L)). What’s the difference?my_first_objectMySecondObjectmy_first_object with MySecondObject. What do you get?my_namemy_name + 3? Why do you get this result?Since class determines the behavior of the object, it is crucial to know your data class especially performing function calls.
Note
Calling a function means executing/applying a function.
As you cannot use add function on character and numeric.
You can define custom function (User-defined function) in R with the following syntax:
The function can be called in prefix form:
Our first function, c() concatenates all the values and generate a single object!
seq function: generates a sequence of numbers.
from and to and byAll R functions are built by someone, and documentation is typically provided.
For detailed description of any function, use ? followed by the function’s name.
For example, try below code in your console:
Or, use help()
The present value of a perpetuity, where the cash flow grows at a constant rate g, is given by:
\[ PV_{PER} = \frac{PMT}{r - g} \]
where
This formula applies when r > g.
You can design your perpetuity function in R with following syntax:
Let’s call the function above:
Define a perpetuity calculator function, pv_per(). What is the pv when PMT = $50,000, r = 4%, g = 0%?
What is the pv when PMT = $50,000, r = 4%, but g are 1%, 2%, 3%?
What happens if user doesn’t specify one argument?
You can set default values for arguments, allowing them to be omitted when calling the function.
Functions are typically named so they can be reused multiple times.
However, you can skip naming a custom function, and they are called anonymous function.
They are not stored as objects since they do not have assigned symbols (names).
Note
Syntactic sugar refers to a feature in programming that makes the code simple to read or write, without adding functionality.
(Anonymous) functions can be defined with syntactic sugar:
Convert below perpetuity function (pv_per) to anonymous function:
A Motivating example
Solve below math problem. Describe your steps. What was the first and the last step?
\[ \sqrt{(2+4)^2 - 3 * 4} = ? \]
\[ \sqrt{(2+4)^2 - 3 * 4} \]
and then square it, and save it in your memoryand then subtract it from previous, and update your memoryand then square root the valueSimilarly, codes can be written not in the order we calculate.
It is easier for us to read & write code in the order it is operated.
When we have composite function calls such as
The call sequence is x -> k() -> h() -> g() -> f().
It is rather easier to read, write and debug if we can write a code like:
This is where pipe operator |> becomes handy in R.
The pipe operator does “and then” job, and it can be written as:
Tip
Style guide: use |> instead of %>%. Use shortcut Cmd (Ctrl) + Shift + M.
Sometimes you’ll see %>% operator instead, which comes from external library in R (magrittr), meanwhile |> is R native. In order to use %>%, external package library(magrittr) should be imported.
Solve \(\sqrt{2^3}\) using pipe operator.
sqrt() functioncube that does x^32.Packages are add-on libraries that extend the functionality of R.
install.packages()library()Installing packages:
Load packages: you need to load packages to use its functionality.
Control structure dictates which code gets executed and when.
if statements: Execute code if a condition is true.else/else if statements: Execute alternative code if the condition is false.for loops: Repeat code block a specified number of times.while loops: Continue executing code as long as a condition is true.The basic form of if and if-else statement in R:
Example 1: if and else executes code based on logical conditions.
Example 2: If condition is not met, then nothing happens (skipped).
Example 3: else if checks one more logic condition:
Example 4: There can be multiple else if
Example 5: else is executed when all of if conditions are not met.
[1] "The stock price has decreased."
< or >)<= or >=)==) and Inequality (!=)!)&), OR (|)Write an if-else statement:
PMT <- PMT + 10000)What is the outcome of above if-else, if initial PMT was 750?
For loops are used when code has to be iterated a specified number of times.
If for loop was explictly written:
To use number index for each element: use seq_along() in the loop.
x itself can be a iterable:
Generally used with if-else condition tests inside loop.
next is used to skip an iteration of loop.
break is used to exit loop immediately.
How to calculate compound interest over multiple years using a for loop?
[1] 10500.00 11025.00 11576.25 12155.06 12762.82 13400.96 14071.00 14774.55
[9] 15513.28 16288.95
Based on the previous example, do the following:
Q1. Skip the first year using if and next
Q2. Stop the calculation if value exceeds $14,000
Write a function that checks class of an input.
If the input is numeric, print “Numeric input!”, otherwise, print “Not numeric!”
inherits(x, "numeric") for logical test.map or apply is an implicit function loop.
f is an input arg for map()map() requires tidyverse or purrr packagefor loopsmap()If the desired output is not list but atomic vector:
map_dbl() a numeric (double) vectormap_chr() a character vectormap_lgl() a logical vectormap_int() an integer vectorGenerate times_two() function that multiplies input by 2.
map times_two function over 1:10
Achieve same result with anonymous function instead.
Achieve same result with for loop.
Convert below for loop operation to map()
data.frame classOne of the most important data class in R, built on top of list type
Stores data structure in 2D tabular form:
with rows (observations, or records)
and columns (variables)
columns can be different types!
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
Create a dataframe named as housing:
What are the type (class) of each column automatically recognized by R?
str(housing).Q: What should be their type (class) in theory?
dplyr / tidyverse
Key verbs
|> or |>select() : select subset of columns
rename() : rename columnsrelocate() : change column positionsfilter() : select subset of rows with conditionarrange() : reorder rowsmutate() : add new columns (variables)summarize() : generate summary table based on group_by() Sepal.Length Petal.Width Species
1 5.1 0.2 setosa
2 4.9 0.2 setosa
3 4.7 0.2 setosa
Select from “Sepal.Length” (1st) to “Petal.Length” (3rd) column
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
Select columns except for specified columns
Convenience functions are allowed within select().
Examples:
starts_with(), ends_with(), contains()
…and lots more!
matches(), num_range(), all_of(), any_of(), everything(), last_col(), where()
Check tidyselect documentation
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
Sepal.Length Petal.Length
1 5.1 1.4
2 4.9 1.4
3 4.7 1.3
Rearrange the column order.
Non-mentioned columns retain original ordering.
Example 1: Subset rows with conditions
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Example 2: Multiple conditions
# A tibble: 3 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
2 0.26 Good D VS1 58.4 63 403 4.19 4.24 2.46
3 0.26 Good E VVS1 57.9 60 554 4.22 4.25 2.45
slice()From diamonds dataframe (load tidyverse for access),
Subset the dataframe with below conditions:
carat is equal to 0.26 and;
clarity is “VS2” and;
select columns “carat”, “cut”, “clarity”
then store it as sub_diamonds
What is the dimension of sub_diamonds? Check with dim(sub_diamonds)
Arrange, or sort the dataframe based on the specified column value
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
Arranging with multiple variables: Hierachical ordering
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 4.4 2.9 1.4 0.2 setosa
3 4.4 3.0 1.3 0.2 setosa
4 4.4 3.2 1.3 0.2 setosa
5 4.5 2.3 1.3 0.3 setosa
Caution
It is incorrect to chain arrange function:
iris |> arrange(Sepal.Length) |> arrange (Sepal.Width)
because it resets ordering.
Compute transformation of variables and create new column.
\[ \frac{X - \bar{X}}{\sigma_X} = \frac{X - mean(X)}{sd(X)} \]
Many times we are interested in getting summary statistics for groups.
Summarizing is also called as data aggregation
Often used with group_by() , to generate summary
Example: What is the average of “Sepal.Length” by “Species” in iris dataset?
# A tibble: 9 × 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3 5.9 2.1 virginica
Above code shows the first 3 rows for each Species.
Example: What is the average of Sepal.Length by each group?
What if we want to summarize across every column?
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03
From diamonds dataframe:
filter() and summarize()min() and max()n()xyz, which is mulplication of x and y and z. Store the dataframe as my_diamond.max(my_diamond$xyz))?cor(x,y) for correlation between x and yRunning repeated regressions and get coefficients, by each group.
# A tibble: 3 × 3
# Groups: Species [3]
Species intercept slope
<fct> <dbl> <dbl>
1 setosa 2.64 0.690
2 versicolor 3.54 0.865
3 virginica 3.91 0.902
tidyquant package| symbol | mean(volume) |
|---|---|
| AAPL | 98953731 |
| BAC | 52857474 |
| MSFT | 30654553 |
| TSLA | 133226380 |
filter() and use year() and month() to date variable| symbol | mean(volume) |
|---|---|
| AAPL | 285457836 |
| BAC | 128477227 |
| MSFT | 73304341 |
| TSLA | 287001136 |
The real world data does not come clean, ready for your analysis.
You will learn a consistent data structure, i.e. Tidy data.
Same data can be presented in various ways. The example data has 4 information:
Data (table1, table2, table3) is is available when you load tidyverse
| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 172006362 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1272915272 |
| China | 2000 | 213766 | 1280428583 |
| country | year | type | count |
|---|---|---|---|
| Afghanistan | 1999 | cases | 745 |
| Afghanistan | 1999 | population | 19987071 |
| Afghanistan | 2000 | cases | 2666 |
| Afghanistan | 2000 | population | 20595360 |
| Brazil | 1999 | cases | 37737 |
| Brazil | 1999 | population | 172006362 |
| Brazil | 2000 | cases | 80488 |
| Brazil | 2000 | population | 174504898 |
| China | 1999 | cases | 212258 |
| China | 1999 | population | 1272915272 |
| China | 2000 | cases | 213766 |
| China | 2000 | population | 1280428583 |
| country | year | rate |
|---|---|---|
| Afghanistan | 1999 | 745/19987071 |
| Afghanistan | 2000 | 2666/20595360 |
| Brazil | 1999 | 37737/172006362 |
| Brazil | 2000 | 80488/174504898 |
| China | 1999 | 212258/1272915272 |
| China | 2000 | 213766/1280428583 |
Which data structure is easier to work with, for general purpose?
A definition of tidy data:
When data is tidy, it is generally easier to work with the data.
If you want to generate a rate of TB per 10,000 population on table1:
# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
It is not simple to operate this with table2 setting.
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Load package tidyverse.
table1 will be ready for access.
group_by() and summarize())For your analysis, you will need to pivot the data (aka reshape):
to longer form (less variables and more observations)
or to wide form (more columns and less rows)
Tip
Whenever you pivot the data, think about columns that are affected, and the names and values.
Let’s take a look at billboard dataset which is in wide form.
Each observation is a song, and we have 76 columns that describe rank.
To tidy up, we want “rank” variable in column that stores the number.
How can we pivot the data so that we have rank in one column?
Simple illustration on how pivot_longer() works:
| artist | track | date.entered | week | rank |
|---|---|---|---|---|
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk1 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk2 | 82 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk3 | 72 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk4 | 77 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk5 | 87 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk6 | 94 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk7 | 99 |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk8 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk9 | NA |
| 2 Pac | Baby Don't Cry (Keep... | 2000-02-26 | wk10 | NA |
billboard data withartist, track the columns that are “wk1” to “wk12”?num_rangenum_range() in select()pivot_wider() works in opposite way in that:
increases the number of columns (variables)
decreases the number of rows (observations)
Wider forms are common for machine learning.
Previous billboard_longer data:
# A tibble: 5 × 5
artist track date.entered week rank
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
Pivot the data to wider form:
# A tibble: 5 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
# wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
Note that we need at least two inputs for pivot_wider().
Stock price data commonly comes in the form below:
| symbol | date | open | high | low | close | volume | adjusted |
|---|---|---|---|---|---|---|---|
| TSLA | 2023-01-03 | 118.47 | 118.80 | 104.64 | 108.10 | 231402800 | 108.10 |
| TSLA | 2023-01-04 | 109.11 | 114.59 | 107.52 | 113.64 | 180389000 | 113.64 |
| TSLA | 2023-01-05 | 110.51 | 111.75 | 107.16 | 110.34 | 157986300 | 110.34 |
| TSLA | 2023-01-06 | 103.00 | 114.39 | 101.81 | 113.06 | 220911100 | 113.06 |
| TSLA | 2023-01-09 | 118.96 | 123.52 | 117.11 | 119.77 | 190284000 | 119.77 |
| TSLA | 2023-01-10 | 121.07 | 122.76 | 114.92 | 118.85 | 167642500 | 118.85 |
Question: What if you wanted to have a column for each stock’s adjusted price, like below?
| date | TSLA | BAC | XOM |
|---|---|---|---|
| 2023-01-03 | 108.10 | 30.97427 | 95.43420 |
| 2023-01-04 | 113.64 | 31.55659 | 95.71197 |
| 2023-01-05 | 110.34 | 31.49189 | 97.85343 |
| 2023-01-06 | 113.06 | 31.80617 | 99.03617 |
| 2023-01-09 | 119.77 | 31.32551 | 97.19038 |
| 2023-01-10 | 118.85 | 31.53811 | 98.64192 |
# A tibble: 250 × 4
date TSLA BAC XOM
<date> <dbl> <dbl> <dbl>
1 2023-01-03 108. 31.0 95.4
2 2023-01-04 114. 31.6 95.7
3 2023-01-05 110. 31.5 97.9
4 2023-01-06 113. 31.8 99.0
5 2023-01-09 120. 31.3 97.2
6 2023-01-10 119. 31.5 98.6
7 2023-01-11 123. 31.8 99.8
8 2023-01-12 124. 31.9 101.
9 2023-01-13 122. 32.6 101.
10 2023-01-17 131. 31.9 101.
# ℹ 240 more rows
table1 and table2 available on tidyverse.How can you transform table1 to table2?
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Examine data population available from tidyverse.
Pivot the data to generate output as below.
(Hint: Browse document file ?pivot_wider() and see names_prefix argument.)
| country | year_1995 | year_1996 | year_1997 | year_1998 | year_1999 | year_2000 | year_2001 | year_2002 | year_2003 | year_2004 | year_2005 | year_2006 | year_2007 | year_2008 | year_2009 | year_2010 | year_2011 | year_2012 | year_2013 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 17586073 | 18415307 | 19021226 | 19496836 | 19987071 | 20595360 | 21347782 | 22202806 | 23116142 | 24018682 | 24860855 | 25631282 | 26349243 | 27032197 | 27708187 | 28397812 | 29105480 | 29824536 | 30551674 |
| Albania | 3357858 | 3341043 | 3331317 | 3325456 | 3317941 | 3304948 | 3286084 | 3263596 | 3239385 | 3216197 | 3196130 | 3179573 | 3166222 | 3156608 | 3151185 | 3150143 | 3153883 | 3162083 | 3173271 |
| Algeria | 29315463 | 29845208 | 30345466 | 30820435 | 31276295 | 31719449 | 32150198 | 32572977 | 33003442 | 33461345 | 33960903 | 34507214 | 35097043 | 35725377 | 36383302 | 37062820 | 37762962 | 38481705 | 39208194 |
| American Samoa | 52874 | 53926 | 54942 | 55899 | 56768 | 57522 | 58176 | 58729 | 59117 | 59262 | 59117 | 58652 | 57919 | 57053 | 56245 | 55636 | 55274 | 55128 | 55165 |
| Andorra | 63854 | 64274 | 64090 | 63799 | 64084 | 65399 | 68000 | 71639 | 75643 | 79060 | 81223 | 81877 | 81292 | 79969 | 78659 | 77907 | 77865 | 78360 | 79218 |
| Angola | 12104952 | 12451945 | 12791388 | 13137542 | 13510616 | 13924930 | 14385283 | 14886574 | 15421075 | 15976715 | 16544376 | 17122409 | 17712824 | 18314441 | 18926650 | 19549124 | 20180490 | 20820525 | 21471618 |
| Anguilla | 9807 | 10063 | 10305 | 10545 | 10797 | 11071 | 11371 | 11693 | 12023 | 12342 | 12637 | 12903 | 13145 | 13365 | 13571 | 13768 | 13956 | 14132 | 14300 |
| Antigua and Barbuda | 68349 | 70245 | 72232 | 74206 | 76041 | 77648 | 78972 | 80030 | 80904 | 81718 | 82565 | 83467 | 84397 | 85349 | 86300 | 87233 | 88152 | 89069 | 89985 |
| Argentina | 34833168 | 35264070 | 35690778 | 36109342 | 36514558 | 36903067 | 37273361 | 37627545 | 37970411 | 38308779 | 38647854 | 38988923 | 39331357 | 39676083 | 40023641 | 40374224 | 40728738 | 41086927 | 41446246 |
| Armenia | 3223173 | 3173425 | 3137652 | 3112958 | 3093820 | 3076098 | 3059960 | 3047002 | 3036032 | 3025652 | 3014917 | 3002911 | 2989882 | 2977488 | 2968154 | 2963496 | 2964120 | 2969081 | 2976566 |
Financial data encompasses information related to financial markets, instruments, and economic indicators. It includes:
The risk-return tradeoff is a fundamental concept in finance
Return Calculation for Individual Assets:
The holding period return of an asset (\(r\)) over a period is calculated using the formula:
\[ r = \frac{P_{end} - P_{begin} + D}{P_{begin}} \]
\(r\) is the rate of return for the holding period
\(P_{end}\) is the ending price of the asset
\(P_{begin}\) is the beginning price of the asset
\(D\) represents any dividends or income received during the period
Tip
The return on a portfolio is a weighted sum of the individual returns of the assets within the portfolio.
The return of a portfolio (\(r_p\)) is calculated as:
\[ r_p = \sum_{i=1}^{n} w_i r_i \]
\(r_p\) is the return of the portfolio over the period
\(w_i\) is the weight of asset \(i\) in the portfolio
\(r_i\) is the return of asset \(i\) over the period
\(n\) is the number of assets in the portfolio
Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.
If 60% of your portfolio is invested in asset A and 40% in asset B,
Calculate the expected annual return of the portfolio.
Stand-alone risk considers the risk of a single asset independently
The formula for risk (\(\sigma\)) is:
\[ \sigma = \sqrt{\frac{\sum (r_t - \bar{r})^2}{T - 1}} = sd(r) \]
Where:
Portfolio risk involves the risk associated with holding a portfolio of assets.
Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.
The annual standard deviation of returns for asset A is 10% (0.1), and for asset B, it’s 15% (0.15). The correlation coefficient between the returns of assets A and B is 0.5.
If 60% of your portfolio is invested in asset A and 40% in asset B,
Calculate the expected annual risk of the portfolio.
\[ σ_p^2=w_A^2σ_A^2+w_B^2σ_B^2+2w_Aw_Bσ_Aσ_BρXY \]
At each time \(t\), calculate the return of a portfolio (\(r_p\))
\[ r_{p,t} = \sum_{i=1}^{n} w_{i,t} r_{i,t} \]
Then calculate standard deviation of portfolio returns (\(\sigma_p\)) over \(t\)
\[ \sigma_p = sd(r_p) = \sqrt{\frac{\sum (r_{p,t} - \bar{r_p})^2}{T - 1}} \]
# A tibble: 6 × 3
symbol date adjusted
<chr> <date> <dbl>
1 TSLA 2020-01-02 28.7
2 TSLA 2020-01-03 29.5
3 TSLA 2020-01-06 30.1
4 TSLA 2020-01-07 31.3
5 TSLA 2020-01-08 32.8
6 TSLA 2020-01-09 32.1
# A tibble: 6 × 3
# Groups: symbol [1]
symbol date daily_return
<chr> <date> <dbl>
1 BAC 2020-01-03 -0.0208
2 BAC 2020-01-06 -0.00143
3 BAC 2020-01-07 -0.00660
4 BAC 2020-01-08 0.0101
5 BAC 2020-01-09 0.00172
6 BAC 2020-01-10 -0.00828
We will use pivot technique, to deviate from tidy form.
# A tibble: 6 × 4
date BAC TSLA XOM
<date> <dbl> <dbl> <dbl>
1 2020-01-03 -0.0208 0.0296 -0.00804
2 2020-01-06 -0.00143 0.0193 0.00768
3 2020-01-07 -0.00660 0.0388 -0.00818
4 2020-01-08 0.0101 0.0492 -0.0151
5 2020-01-09 0.00172 -0.0219 0.00766
6 2020-01-10 -0.00828 -0.00663 -0.00889
Drop row with missing values
# A tibble: 6 × 5
date BAC TSLA XOM port_ret
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-01-03 -0.0208 0.0296 -0.00804 0.000718
2 2020-01-06 -0.00143 0.0193 0.00768 0.00933
3 2020-01-07 -0.00660 0.0388 -0.00818 0.00623
4 2020-01-08 0.0101 0.0492 -0.0151 0.00924
5 2020-01-09 0.00172 -0.0219 0.00766 -0.00241
6 2020-01-10 -0.00828 -0.00663 -0.00889 -0.00809
Derived from portfolio theory and optimization.
\[ E(r_i)=r_f+β_i(E(r_m)−r_f)\]
Where:
\(E(r_i)\) is the expected return of the investment
\(r_f\) is the risk-free rate
\(\beta_i\) is the beta of the investment
\(E(r_m)\) is the expected return of the market
\((E(r_m) - r_f)\) is known as the market risk premium
Beta is a measure of the sensitivity of an individual investment’s returns to the market.
\[ \beta_i = \frac{Cov(r_i - r_f, r_m-r_f)}{Var(r_m-r_f)} \]
\[ r_i - r_f = \beta_0 + \beta_1(r_m - r_f)+ e_i \]
Tip
In some specific cases (e.g., intraday frequency) beta is estimated with raw return (\(r_i\) and \(r_m\)), not excess returns (\(r_i - r_f\))
Prepare a stock return, market return (index).
Then combine two dataframe (bind them in row-wise).
tickers <- c("MSFT")
stock_prices <- tq_get(tickers, from = '2020-01-01', to = '2023-12-31')
snp500 <- tq_get("SP500", get = "economic.data", from = "2020-01-01", to = "2023-12-31")
snp500 <- snp500 |> rename(adjusted = price) # rename column for binding
prices <- bind_rows(stock_prices, snp500)
prices |> glimpse()Rows: 2,049
Columns: 8
$ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
$ date <date> 2020-01-02, 2020-01-03, 2020-01-06, 2020-01-07, 2020-01-08, …
$ open <dbl> 158.78, 158.32, 157.08, 159.32, 158.93, 161.84, 162.82, 161.7…
$ high <dbl> 160.73, 159.95, 159.10, 159.67, 160.80, 162.22, 163.22, 163.3…
$ low <dbl> 158.33, 158.06, 156.51, 157.32, 157.95, 161.03, 161.18, 161.2…
$ close <dbl> 160.62, 158.62, 159.03, 157.58, 160.09, 162.09, 161.34, 163.2…
$ volume <dbl> 22622100, 21116200, 20813700, 21634100, 27746500, 21385000, 2…
$ adjusted <dbl> 152.1584, 150.2638, 150.6521, 149.2785, 151.6563, 153.5510, 1…
Pivot price data into wide form:
# A tibble: 6 × 3
date MSFT SP500
<date> <dbl> <dbl>
1 2020-01-02 152. 3258.
2 2020-01-03 150. 3235.
3 2020-01-06 151. 3246.
4 2020-01-07 149. 3237.
5 2020-01-08 152. 3253.
6 2020-01-09 154. 3275.
Generate stock and index returns
Prepare risk-free rate (use 3 month treasury bill rate)
The risk-free rate is in percentage term (%) and annualized.
Convert the rate into daily level
# A tibble: 6 × 4
symbol date price rf
<chr> <date> <dbl> <dbl>
1 DGS3MO 2020-01-01 NA NA
2 DGS3MO 2020-01-02 1.54 0.0000611
3 DGS3MO 2020-01-03 1.52 0.0000603
4 DGS3MO 2020-01-06 1.56 0.0000619
5 DGS3MO 2020-01-07 1.54 0.0000611
6 DGS3MO 2020-01-08 1.54 0.0000611
Convert to wide form:
# A tibble: 6 × 2
date DGS3MO
<date> <dbl>
1 2020-01-01 NA
2 2020-01-02 0.0000611
3 2020-01-03 0.0000603
4 2020-01-06 0.0000619
5 2020-01-07 0.0000611
6 2020-01-08 0.0000611
Join two dataframe column-wise
# A tibble: 3 × 4
date MSFT SP500 DGS3MO
<date> <dbl> <dbl> <dbl>
1 2020-01-02 NA NA 0.0000611
2 2020-01-03 -0.0125 -0.00706 0.0000603
3 2020-01-06 0.00258 0.00353 0.0000619
Then generate excess returns:
Use lm() for linear regression fit.
Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)
Coefficients:
(Intercept) Mkt_exret
0.0005149 1.1718521
To browse summary of regression:
Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)
Residuals:
Min 1Q Median 3Q Max
-0.068986 -0.006335 -0.000456 0.006333 0.076456
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0005149 0.0003655 1.409 0.159
Mkt_exret 1.1718521 0.0251585 46.579 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01154 on 996 degrees of freedom
(45 observations deleted due to missingness)
Multiple R-squared: 0.6854, Adjusted R-squared: 0.6851
F-statistic: 2170 on 1 and 996 DF, p-value: < 2.2e-16
How to access and extract coefficient estimates?
(Intercept) Mkt_exret
0.0005148829 1.1718520584
[1] "double"
To extract beta estimate:
CAPM estimate (or any other) is largely dependent on the timeframe of your choice.
Consider following:
Should they be similar? Usually not.
Using our previous example, estimate the CAPM beta with the following:
Given a set “lookback” period, estimate should only use previous information.
To capture timely information at each period, rolling regressions are often performed.
Generate stock prices with below.
Using adjusted daily closing prices, calculate the expected (average) return and risk of
a equal-weighted portfolio
30% on Apple and 70% on Microsoft
70% on Apple and 30% on Microsoft
Based on the portfolio return (70% on Apple and 30% on Microsoft), calculate the portfolio beta.
Importing data is crucial initial process for any data science project.
We will learn how to read external data to R, in data.frame object.
Also how to write data.frame in R to a local file.
CSV: Comma-Seperated Values
A plain, human-readable text data file
Minimalistic, widely used
Typically opened with Excel, but it is not an excel file!
Since it is text, R tries to “guess” the type of each column when importing
A csv representation of iris dataframe:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
When saved to csv file:
Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa
Write and read csv
Many packages support writing/reading csv files;
utils package): basic but slowreadr from tidyverse: extremely fast, functionalWe use readr package (automatically loaded from tidyverse).
To write a data.frame to a csv file: write_csv()
To read a .csv file to a data.frame: read_csv()
# A tibble: 6 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
You can specify a downloadable url instad:
# A tibble: 6 × 9
mpg cylinders displacement horsepower weight acceleration model_year origin
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 18 8 307 130 3504 12 70 usa
2 15 8 350 165 3693 11.5 70 usa
3 18 8 318 150 3436 11 70 usa
4 16 8 304 150 3433 12 70 usa
5 17 8 302 140 3449 10.5 70 usa
6 15 8 429 198 4341 10 70 usa
# ℹ 1 more variable: name <chr>
To read a .xlsx file to a data.frame: read_excel() from readxl package
# A tibble: 6 × 11
`in million USD` `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Net Income/Starting Line 19658 31398 42206 47681 33448 33615
2 Depreciation & Amortiza… 11917 14760 15583 15888 17182 17297
3 Non-Cash Items -1474 129 262 3313 482 3110
4 Change in Working Capit… -1663 2126 -2706 -10712 -6198 -8906
5 Cash from Operating Act… 28438 48413 55345 56170 44914 45116
6 Change in Fixed Assets … -22491 -26871 -30975 -34271 -33669 -32952
# ℹ 4 more variables: `FY '15` <dbl>, `FY '16` <dbl>, `FY '17` <dbl>,
# `FY '18` <dbl>
There are other common data formats:
janitor package offers simple variable name cleaner: clean_names().
# A tibble: 6 × 11
in_million_usd fy_09 fy_10 fy_11 fy_12 fy_13 fy_14 fy_15 fy_16 fy_17
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Net Income/Sta… 19658 31398 42206 47681 33448 33615 16551 8375 19848
2 Depreciation &… 11917 14760 15583 15888 17182 17297 18048 22308 19893
3 Non-Cash Items -1474 129 262 3313 482 3110 -750 -5313 -7921
4 Change in Work… -1663 2126 -2706 -10712 -6198 -8906 -3505 -3288 -1754
5 Cash from Oper… 28438 48413 55345 56170 44914 45116 30344 22082 30066
6 Change in Fixe… -22491 -26871 -30975 -34271 -33669 -32952 -26490 -16163 -15402
# ℹ 1 more variable: fy_18 <dbl>
janitor and readxlread_excel() and store as exxon_statementclean_names().exxon_categories.csv file with read_csv() as exxon_categories.exxon_categories to exxon_statement as:in_million_usd to Account.exxon_long with below arguments:starts_with()Categorical variables (factors) take a predefined set of values.
To visualize distribution of categorical variable, bar plots are often used.
geom_bar() and geom_col()geom_bar() : When you need count of single category
geom_col() : When you need different Y
The end goal of the bar plot:
To fill the color, use “fill”. For border color, use “color”.
To change scientific notation to currency format: use scales package.
To add more breaks (ticks) on Y axis, control n.breaks in continuous scale.
To add axis labels and plot title, use labs() function.
total_amt_bycat |>
ggplot(aes(x = Category, y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Revenue / Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Revenue / Expenditure ($)",
caption = "Source: Exxon Financial Statements"
)By default, ggplot uses gray background. To change, you can use other built-in themes.
total_amt_bycat |>
ggplot(aes(x = Category, y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Revenue / Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Revenue / Expenditure ($)",
caption = "Source: Exxon Financial Statements"
) +
theme_bw()To reorder factor variable according to its value: fct_reorder() on the variable.
total_amt_bycat |>
ggplot(aes(x = fct_reorder(Category, -Total_USD) , y = Total_USD)) +
geom_col(fill = "lightblue", color = "black") +
scale_y_continuous(
n.breaks = 6,
labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
) +
labs(
title = "Exxon Mobile Total Expenditure",
subtitle = "From Years 2009 - 2018",
x = "",
y = "Total Expenditure ($)",
caption = "Source: Exxon Financial Statements"
) +
theme_bw()Numeric variable has infinite range of values.
Commonly used visualization is histogram
stock_returns <- tq_get(c("AAPL", "KO"), from = "2020-01-01", to = "2023-12-31") |>
group_by(symbol) |>
arrange(symbol, date) |>
mutate(pct_ret = (adjusted / lag(adjusted) - 1) * 100) |>
ungroup() |>
select(symbol, date, pct_ret)
stock_returns |>
filter(symbol == "AAPL") |>
ggplot(aes(x = pct_ret)) +
geom_histogram() # default bins: 30An alternative for numeric distribution is density plot.
Visualizing two (or more) variables can be useful to show variable relationships.
Barplots can be used to visualize two categorical variables.
Scatterplots and regressions can effectively visualize relationships.
Time series plots have date variable on X axis.
Line plots are often used.
Candlestick price charts are often used as well:
# Plot candlestick chart
stock_price <- tq_get("AAPL", from = "2023-12-01", to = "2023-12-31")
stock_price |>
ggplot(aes(x = date, open = open, high = high, low = low, close = close)) +
geom_candlestick() +
labs(
title = "AAPL Candlestick Chart, December 2023",
x = "Date",
y = "Price (USD)") +
theme_bw() stock_price |>
ggplot(aes(x = as.factor(date), open = open, high = high, low = low, close = close)) +
geom_candlestick() +
labs(
title = "AAPL Candlestick Chart, December 2023",
x = "Date",
y = "Price (USD)") +
theme_bw() +
scale_x_discrete(
breaks = as.factor(stock_price$date[seq(1, length(stock_price$date), 3)])
)FIN6776: Big Data and Machine Learning in Finance